PageRank optimization applied to spam detection 



Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique 
olivier.fercoq@inria.fr 



(N 

o 

(N 

H 



u 

o 

■*— > 



> 

in 

en 

o 



X 

c3 



ABSTRACT 

We give a new link spam detection and PageRank demotion 
algorithm called MaxRank. Like TrustRank and AntiTrust- 
Rank, it starts with a seed of hand-picked trusted and spam 
pages. We define the MaxRank of a page as the frequency 
of visit of this page by a random surfer minimizing an av- 
erage cost per time unit. On a given page, the random 
surfer selects a set of hyperlinks and clicks with uniform 
probability on any of these hyperlinks. The cost function 
penalizes spam pages and hyperlink removals. The goal is 
to determine a hyperlink deletion policy that minimizes this 
score. The MaxRank is interpreted as a modified PageRank 
vector, used to sort web pages instead of the usual Page- 
Rank vector. The bias vector of this ergodic control prob- 
lem, which is unique up to an additive constant, is a mea- 
sure of the "spamicity" of each page, used to detect spam 
pages. We give a scalable algorithm for MaxRank computa- 
tion that allowed us to perform experimental results on the 
WEBSPAM-UK2007 dataset. We show that our algorithm 
outperforms both TrustRank and AntiTrustRank for spam 
and nonspam page detection. 

1. INTRODUCTION 

Internet search engines use a variety of algorithms to sort 
web pages based on their text content or on the hyperlink 
structure of the web. In this paper, we focus on algorithms 
that use the latter hyperlink structure, called link-based al- 
gorithms. The basic notion for all these algorithms is the 
web graph, which is a digraph with a node for each web page 
and an arc between pages i and j if there is a hyperlink from 
page i to page j. 

One of the main link-based ranking methods is the Page- 
Rank introduced by Brin and Page [To]. It is defined as 
the invariant measure of a walk made by a random surfer 
on the web graph. When reading a given page, the surfer 
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either selects a link from the current page (with a uniform 
probability), and moves to the page pointed by that link, 
or interrupts his current search, and then moves to an ar- 
bitrary page, which is selected according to given "telepor- 
tation" probabilities. The rank of a page is defined as its 
frequency of visit by the random surfer. It is interpreted as 
the "popularity" of the page. 

From the early days of search engines, some webmasters have 
tried to get their web pages overranked thanks to malicious 
manipulations. For instance, adding many keywords on a 
page is a classical way to make search engines consider a page 
relevant to many queries. With the advent of link-based 
algorithms, spammers have developed new strategies, called 
link-spamming |19| , that intend to give some target page 
a high score. For instance, Gyongyi and Garcia-Molina [18] 
showed various linking strategies that improve the PageRank 
score of a page. They justified the presence of link farms 
with patterns with every page linking to one single page. 
Baeza- Yates, Castillo and Lopez p] also showed that making 
collusions is a good way to improve PageRank. 

In order to fight such malicious manipulations that deteri- 
orate search engines' results and deceive web surfers, var- 
ious techniques have been developed. We refer to [TT] for 
a detailed survey of this subject. Each spam detection al- 
gorithm is focused on a particular aspect of spam pages. 
Content analysis (see [25] for instance) is the main tool to 
detect deceiving keywords. Some simple heuristics p] may 
be enough to detect the most coarse link-spam techniques, 
but more evolved graph algorithms like clique detection [27] , 
SpamRank [8 or Truncated PageRank j\ have also been 
developed to fight link-spamming. As web spammers adapt 
themselves to detection algorithms, machine learning tech- 
niques [IB] try to discover actual predominant spam strate- 
gies and to adapt to its evolutions. Another direction of 
research concerns the propagation of trust through the web 
graph with the TrustRank algorithm [20J and its variants or 
the propagation of distrust through a reversed web graph 
with the AntiTrustRank algorithm [221 . 



In this paper, we develop a new link spam detection and 
PageRank demotion algorithm called MaxRank. Like in [20| 
|22| , we start with a seed of hand-picked trusted and spam 
pages. We define the MaxRank of a page as the frequency 
of visit of this page by a random surfer minimizing an aver- 
age cost per time unit. On a given page, the random surfer 
selects a set of hyperlinks and clicks with uniform probabil- 



ity on any of these hyperlinks. The cost function penalizes 
spam pages and hyperlink removals. The goal is to deter- 
mine an optimal hyperlink deletion policy. The features of 
MaxRank are based on PageRank optimization [3| |24| |14| 
21 13 and more particularly on the results of [15|. Those 



papers have shown that the problem of optimizing the Page- 
Rank of a set of pages by controlling some hyperlinks can 
be solved by Markov Decision process algorithms. In these 
works, the optimization of PageRank was thought from a 
webmaster's point of view whereas here, we take the search 
engine's point of view. We show that the Markov Decision 
Process defining MaxRank is solvable in polynomial time 
(Proposition pjj, because the polytopes of occupation mea- 
sures admit efficient (polynomial time) separation oracles. 
The invariant measure of the Markov Decision Process, the 
MaxRank vector, is interpreted as a modified PageRank vec- 
tor, used to sort web pages instead of the usual PageRank 
vector. The solution of the ergodic dynamic programming 
equation, called the bias vector, is unique up to an additive 
constant and is interpreted as a measure of the "spamicity" 
of each page, used to detect spam pages. 

We give a scalable algorithm for MaxRank computation 
that allowed us to perform numerical experiments on the 
WEBSPAM-UK2007 dataset [I]. We show that our algo- 
rithm outperforms both TrustRank and AntiTrustRank for 
spam and nonspam page detection. As an example, on the 
WEBSPAM-UK2007 dataset, for a recall of 0.8 in the spam 
detection problem, MaxRank has a precision of 0.87 while 
TrustRank has a precision of 0.30 and AntiTrustRank a pre- 
cision of 0.13. 

2. THE PAGERANK AND (ANTI- 
TRUSTRANK ALGORITHMS 

We first recall the basic elements of the Google PageRank 
computation, see [To] and 23 for more information. We call 



web graph the directed graph with a node per web page and 
an arc from page i to page j if page i contains a hyperlink 
to page j. We identify the set of pages to [n] :— {1, . . . ,n}. 

Let Di denote the number of hyperlinks contained in page i. 
Assume first that Di > 1 for all i G [n] , meaning that every 
page has at least one outlink. Then, we construct the n x n 
stochastic matrix S, which is such that 



D i 1 if page j is pointed to from page i 







otherwise 



(1) 



We also fix a row vector z G 



the zapping or teleportation 



vector, which must be stochastic (so, J^ 



3'6N ^ 



1), to- 



gether with a damping factor a g [0, 1] (typically a = 0.85) 
and define the new stochastic matrix 

P = aS + {l -a)ez 

where e is the (column) vector in R" with all entries equal 
to 1. 

When some page i has no outlink, Di = 0, and so the entries 
of the ith row of the matrix 5* cannot be defined according 
to (IT]). Then, we set Sij := zj. 

Consider now a Markov chain (Xt)t>o with transition ma- 



trix P, so that for all i, j G [n], ¥(X t +i = j\X t = i) = P iti . 
Then, Xt represents the position of a websurfer at time t: 
when at page i, the websurfer continues his current explo- 
ration of the web with probability a and moves to the next 
page by following the links included in page i, as above, or 
with probability 1 — a, stops his current exploration and 
then teleports to page j with probability Zj . 

The PageRank n is defined as the invariant measure of the 
Markov chain (X t )t>o representing the behavior of the web- 
surfer. This invariant measure is unique if a < 1 or if P is 
irreducible. 

The TrustRank algorithm 20 is a semi-automatic algorithm 
designed for spam detection and PageRank demotion. It 
starts with a seed of hand-picked trusted pages and then 
launches the PageRank algorithm with a teleportation vec- 
tor with nonzero entries only on this seed of trusted pages. 
Thus the initial trust score of trusted pages will propagate 
through hyperlinks. The fundamental idea is that trusted 
pages link to trusted pages and spam pages are linked to 
by spam pages Fl2] . Each coordinate of the vector obtained 
this way gives the TrustRank score of the associated page. 
Pages with a high TrustRank score are trusted and consid- 
ered to be non spam, pages with a small Trustrank score are 
untrusted and considered to be spam. The boundary be- 
tween trusted and untrusted pages is an arbitrary threshold 
in TrustRank. As it is a variant of PageRank, the TrustRank 
score can also be used as such to rank pages instead of the 
usual PageRank. This is called PageRank demotion, where 
untrusted pages get a negative promotion of their score. 

The AntiTrustRank algorithm ;22 ; follows the same idea as 
TrustRank but it uses a seed of spam pages instead of a 
seed of trusted pages and launches the PageRank algorithm 
using reversed arcs, i.e., it considers a reversed web graph 
where there is an arc between nodes i and j if page j points 
to page i. Pages with a high AntiTrustRank are then con- 
sidered to be spam pages. 

3. WELL-DESCRIBED MARKOV DECISION 
PROCESSES 

The definition of the MaxRank relies on the notion of Markov 
Decision Processes that we recall now. 

A finite Markov decision process is a 4-uple (7, (Ai)i^i,p, c) 
where / is a finite set called the state space; for all j G /, 
Ai is the finite set of admissible actions in state i; p : I x 
Uig/({i} x Ai) — s> R+ is the transition law, so that p(j\i,a) 
is the probability to go to state j form state i when action 
a G Ai is selected; and c : Uig/({i} x Ai) — > R is the cost 
function, so that c(i, a) is the instantaneous cost when action 
a is selected in state i. 

Let Xt £ I denote the state of the system at the discrete 
time t > 0. A deterministic control strategy v is a sequence of 
actions [yt)t>o such that for all t > 0, ft is a function of the 
history h t = (X , v> ,..., X t -i,u t -i,Xt) and v t G A Xt ■ Of 
course, P(X t+ i = j\X t ,v t ) = p(j\X t ,u t ),\fj G [n],Vi > 0. 
More generally, we may consider randomized strategies v 
where vt is a probability measure on Ax t - A strategy v is 
stationary (feedback) if there exists a function v such that 
for all t > 0, vt{ht) = v(X t ). 



Given an initial distribution fi representing the law of Xo, 
the average cost infinite horizon Markov decision problem, 
also called ergodic control problem, consists in maximizing 



the rational numbers 
polyhedra Bi . 



)j, and Ci and of the well-described 



1 T_1 
liminf-E(Vc(X t ,^)) 



(2) 



where the maximum is taken over the set of randomized con- 
trol strategies v. Indeed, the supremum is the same if it is 
taken only over the set of randomized (or even determinis- 
tic) stationary feedback strategies (Theorem 9.1.8 in [26] for 
instance). 

A Markov decision process is unichain if the transition ma- 
trix corresponding to every stationary policy has a single 
recurrent class. Otherwise it is multichain. When the prob- 
lem is unichain, its value does not depend on the initial 
distribution whereas when it is not, one may consider a vec- 
tor (gi)i£i where gi represents the value of the problem (pi 
when starting from state i, meaning that the law of Xo is 
the Dirac measure at point i. 

We shall need an extension of the formalism of Markov Deci- 
sion processes in which the actions are implicitly described. 
This extension, developped in [15|, relies on the theory of 
Groetschel, Lovasz and Schrijver |17| of linear programming 
over polyhedra given by separation oracles. 

Definition 1 (Def. 6.2.2 of [l7]). We say that a po- 
lyhedron B has facet-complexity at most <j) if there exists 
a system of inequalities with rational coefficients that has 
solution set B and such that the encoding length of each in- 
equality of the system (the sum of the number of bits of the 
rational numbers appearing as coefficients in this inequality) 
is at most (f>. 

A well-described polyhedron is a triple (B; n, (j>) where B C 
1™ is a polyhedron with facet- complexity at most cf>. The 
encoding length of B is by definition n + <j>. 

Definition 2 (Prob. (2.1.4) of | IT]). A strong sep- 
aration oracle for a set K is an algorithm that solves the 
following problem: given a vector y, decide whether y G K 
or not and if not, find a hyperplane that separates y from K; 
i.e., find a vector u such that u T y > m&x{u T x,x 6 K}. 

Inspired by Definitionll] we introduced the following notion. 



Definition 3 ( |15| ). A finite Markov decision process 
(I, (Ai)i£i,p, c) is well-described if for every state i g I , we 
have Ai C R Li for some Li g N, if there exists G N such 
that the convex hull of every action set Ai is a well- described 
polyhedron (Bi\Li,cj>) with a polynomial time strong sepa- 
ration oracle, and if the costs and transition probabilities 
satisfy c(i,a) = Y^i<=\L) a ' C i and p(J\i,a) = Y^i£tL] a i-Qi,j> 



Vi,j G /, Va G Ai, where C t and Q t j are given rational 
numbers, for i,j G / and I G [Li]- 

The encoding length of a well-described Markov decision 
process is by definition the sum of the encoding lengths of 



The interest of well described Markov decision processes is 
that, even if they have a number of actions exponential in 
the size of the problem, Theorem 3 in [15] shows that the 
average cost infinite horizon problem for a well-described 
(multichain) Markov decision process can be solved in a time 
polynomial in the input length. 



4. THE MAXRANK ALGORITHM 

In this section, we define the MaxRank algorithm. It is based 
on our earlier works on PageRank optimization [15]. In [15| , 
we considered the problem of optimizing the PageRank of a 
given website from a webmaster's point of view, that is with 
some controlled hyperlinks and design constraints. Here, 
we take the search engine's point of view. Hence, for every 
hyperlink of the web, we can choose to take it into account 
or not: our goal is to forget spam links while letting trusted 
links active in the determination of the ranking. 

As in TrustRank [20] and AntiTrustRank [22] , we start with 
a seed of trusted pages and known spam pages. The basic 
idea is to minimize the sum of PageRank scores of spam 
pages and maximize the sum of PageRank scores of non- 
spam pages, by allowing us to remove some hyperlinks when 
computing the PageRank. However, if we do so, the opti- 
mal strategy simply consists in isolating spam pages from 
trusted pages: there is then no hope to detect other spam 
and nonspam pages. Thus, we add a penalty when a hy- 
perlink is removed, so that "spamicity" can still propagate 
through (removed or not removed) hyperlinks. Finally, we 
also control the teleportation vector in order to penalize fur- 
ther pages that we suspect to be spam pages. 

We model this by a controlled random walk on the web 
graph, in which the hyperlinks can be removed. Each time 
the random surfer goes to a spam page, he gets a positive 
cost, each time he goes to a trusted page, he gets a nega- 
tive cost. When the status of the page is unknown, no cost 
incurs. In addition to this a priori cost, he gets a penalty 
for each hyperlink removed. Like for PageRank, the random 
surfer teleports with probability a at every time step; how- 
ever, in this framework, he chooses the set of pages to which 
he wants to teleport. 

Let T x be the set of pages pointed by x in the original graph 
and D x be the degree of a;. An action consists in determining 
J C T x , the set of hyperlinks that remain, and IC [n], the 
set of pages to which the surfer may teleport. We shall 
restrict I to have a cardinality equal to N < n. Then, the 
probability of transition from page x to page y is 

p(y\x, I, J) - av y {I, J) + (1 - a)zy(I) 

where the teleportation vector and the hyperlink click prob- 
ability distribution are given by 



Zy(I) = 



otherwise 



(z y (I) if|J|=0 

I otherwise 

The cost at page x is given by 



c(x,I,J) = c' x + 7 



Ae-IJI 



c^ is the a priori cost of page x. This a priori cost should 
be positive for known spam pages and negative for trusted 
pages. D x is the degree of x in the web graph and 7 > 
is a penalty factor. The penalty 7 ^'' is proportional to 
the number of pages removed. 

We study the following ergodic control problem: 

1 T_1 
inf limsup-E(Vc(X t ,/ t ,J t )) , (3) 

( I t)t>0'( J t)t>0 T->+oo J J^J 

where an admissible control consists in selecting, at each 
time step t, a subset of pages It C [n] with \I t \ — N to 
which teleportation is permitted, and a subset Jt C Tx t of 
the set of hyperlinks in the currently visited page X t . 

The following proposition gives an alternative formulation 
of Problem (T3b that we will then show to be well-described. 



Proposition 1. Fix N g N and let 

Z = {z G R n | ^ ^ = 1,0 < z, < — } 



»G[n] 



iei Jve fee i/ie set of pages pointed by x in the original graph 
and D x be the degree of x. Let V x be the polyhedron defined 
as the set of vectors (o~,v) G M. Dx+1 x R" such that there 



exists w G R^ +1 ' x 


n verifying 












(4a) 


d=0 








a d >0, 


VdG{0,...,-D,} 




(4b) 


-Dx 


Vj G [n] 




(4c) 


d=0 








V^ d d 


VdG{0,...,-D,} 




(4d) 


i6[»] 








< W? < ^ , 


Vj G [»] 




(4e) 


Wj • = , 


Vj^^,VdG{l,.. 


.,-D-} 


(4f) 


< Wi < — - , 

a 


VjG-F^VdGil,.. 


■,^} 


(4g) 



Then Problem[3 is equivalent to the following ergodic control 
problem 



1 
f limsup — E(y^ c(Xt,ot,vt,z t )) 



m. , 



(5) 



where the cost is defined as 



c(x,a,v,z) = c x +-y 



D*-Y, D dZ do- d 
D x 



and the transitions are 

p(y\x, a, v, z) = av y + (1 - a)z y . 

The admissible controls verify for all t, (at, vt) G extr(Vx t ) 
(the set of extreme point of the polytope) and Zt G extr(Z). 

Indeed, to each action (a, v, z) of Problem |B| corresponds a 
unique action I, J of Problem S and vice versa. Moreover, 
the respective transitions and costs are equal. 



Proof. Fix a page x in [n]. The extreme points of Z are 
the vectors of R" with N coordinates equal to -4 and the 
other ones equal to 0. Hence z(I) G extr(Z) and for each 
extreme point z' of Z there exists I G [n] such that \I\ = N 
and z' = z(I). We shall also describe the set of extreme 
points of V x ■ 

From the theory of disjunctive linear programming IB], we 
can see that the polytope K. = {v \ (a, v) G V x } is the convex 
hull of the union of D x + 1 polytopes that we will denote 
K d , d G {0, 1 . . . D x }. If d = 0, then K Q = Z. If d > 0, 



K d = {v 6 ] 



Z)je[«] "j 



1, o<^< i Vie ^, 



;t((T, ;/) be an extreme point of V x . By Corollary 2.1.2-ii) 
[6], there exists d* such that <r d — 1 and v is an extreme 



Let 
in 

point of K.. As a d =1 and ad — for d / d*, we conclude 
that 2/ is also an extreme point of Kd* — V x l~l {o"|o" d = 1}. 
If d* = 0, !/ G extr(Z) and if d* > 0, the extreme points of 
Kd* correspond exactly to the vectors with d* coordinates 
in T x equal to -fc and the other ones equal to 0. Hence there 
exists I C [n] and JC J s such that |7| = N, \J\ — d* and 
v = v(I,J). 

Conversely, fix I and J and let a(J) be such that a d (J) — 1 
if and only if d = \J\. Then (a(J),v(I,J)) is an extreme 
point of V x by Corollary 2.1.2-i) in [6J. 

Finally the costs are the same since \J\ — 2d=o da d . D 



Proposition 2. 7/ a, 7 andc\, i G [ra] are rational num- 
bers, then the ergodic control problem ([BJ is the average cost 
infinite horizon problem for a well described Markov decision 
process and it is polynomial time solvable. 



Proof. Clearly, the process described is a Markov deci- 
sion process. As the polytopes Vi, i £ [n] and Z are de- 
scribed by a polynomial number of inequalities with at most 
n + 1 terms in each, they are well described. Indeed, the 
separation oracle consisting simply in testing each inequal- 
ity terminates in polynomial time. The cost and transitions 
are linear functions on those polytopes with rational coeffi- 
cients since a and c' iy i G [n] are rational numbers. Thus the 
Markov decision process is well described. By Theorem 3 
in 15 , Problem ([5j is thus solvable in polynomial time. D 



Proposition 3. The dynamic programming equation 



vt + A = min c* + 7 



A 



+ ^ (ai/i + (1 - cl)zs)vj , Vz e [»] (6) 

ie[»] 

/ias a solution v € R™ anrf A € R. TTie constant A is unique 
and is the value of problem |3|. An optimal strategy is ob- 
tained by selecting for each state i, (<r, u) £ Vi and z E Z 
maximizing Equation Q. The function v is called the bias. 

PROOF. Theorem 8.4.3 in [Put94] applied to the unichain 
ergodic control problem H implies the result of the proposi- 
tion but with Vi replaced by extr("Pi). But as the expression 
which is maximized is affine, using Vi or extr(Vi) yields the 
same solution. Proposition [l] gives the equivalence between 
(§ and |5| □ 



Proposition 4. Let T be the dynamic programming op- 
erator R" -> R n defined by 

Ti(y) = min Q+7— %^ +Q V v jVj , Vi £ [n]. 

(T^)eVi Di f-f 



The map T is a- contracting in the sup norm and its fixed 
point v, which is unique, is such that (v, (1 — a) min 2 gz z ■ v) 
is solution of the ergodic dynamic programming equation (|6J) . 

PROOF. The set {v st: (a, v) g Vi} is a set of probability 
measures so A 6 1 => T(v + A) = T(v) + a\ and 11 > w =>■ 
T(v) > T(w). This implies that T is a-contracting. Let i> 
be its fixed point. For all i 6 [n], 

Vj + (1 — a) minz • v 
We get equation (I6J) with constant (1 — a) min z6 z 2 • v. D 



We can then solve the dynamic programming equation |6| 
and so the ergodic control problem S by value iteration 
(outer loop of Algorithm [l]). 

The algorithm starts with an initial potential function v, 
scans repeatedly the pages and updates Vi when i is the 
current page according to Vi <— Ti(v) until convergence is 
reached. Then (v, (1 — a) min z6 z zv) is solution of the er- 
godic dynamic programming equation (T6J) and an optimal 
linkage strategy is recovered by selecting the maximizing 
(a, v) at each page. An optimal teleportation vector is re- 
covered by selecting a maximizing z in min z6 z zv. 

Thanks to the damping factor a, the iteration can be seen 
to be a-contracting if the pages are scanned in a cyclic or- 
der. Thus the algorithm converges in a number of steps 
independent of the dimension of the web graph. 



that for a fixed 0—1 valued a, that is for a fixed number 
of removed hyperlinks, the operator T; maximizes a linear 
function on a hypercube, which reduces essentially to a sort. 
Any extreme point (a, v) of Vi necessarily verifies that a is 
0—1 valued. Thus we just need to choose the best value of 
a among the Di + 1 possibilities. 

Algorithm 1 MaxRank algorithm 



Initialization: v £ R™ 

while ||w-T(u)|| 00 > e do 

Sort (v()fG[n] m increasing order and let (f> : [n] 
be the sort function so that tWi) < • •• < iw n )- 

A < — rr l^j=i V <PU) 
for i from 1 ton do 

Sort (vj)j£F i in increasing order and let ip : 
{1, . . . , \Ti\y such that w^,(i) < •• • < v, 

id 



T, 



A/Kim)- 



8 


for d from 1 to Di do 


9 


d . / , Dj—d.i 


10 


end for 


11 


Vi 4- Ti(v) — min de{0 ,i 


12 


end for 


13 


end while 



Ed 



j) 



,D t }' 



This very efficient algorithm is highly scalable: we used it for 
our experimental results on a large size dataset (Section |5l. 

The following proposition shows that if 7 is too big, then 
the optimal link removal strategy is trivial. It also gives an 
interpretation of the bias vector in terms of number of visits 
of spam pages. 



Proposition 5. If 7 > j^||c||oo, then no link should 
be removed. Moreover if in addition, c'i — 1 when i is a 
spam page and otherwise, then the ith coordinate of the fix 
point of operator T in Proposition^^ is equal to the expected 
mean number of spam pages visited before teleportation when 
starting a walk from page i. 

Proof. Proposition H] gives a normalization of the bias 
vector such that it cannot take any value bigger than ^_°£ , 
as it is the case for PageRank optimization [141115]. 



Now fix i G [n]. Let u° = v{Ti) and v = v(I, J) for J C J~i- 
If J = 0, then 

01, 1 - , 1 1 <, 2q A-|0| 

a, \UV — V V\ < a \vv\ + a \i> v\ < ||c||, 



D, 



If J / 0, then 



a \vv — v v\ < a 



(dT" n")E^ + a W/2 V 



D, 



\J\a\\c\\ 



A-|J|q||c|| 



Di *-< 3 

JGTi 



<7 



Di-\J\ 



This proves that choosing J — Ti is always the best strategy 

when7> I ^||c|| 00 . 



For the evaluation of the dynamic programming operator 
T at a page i (inner for- loop of Algorithm [l]) , we remark 



When no link is removed and c' is defined as in the proposi- 
tion, we are in the framework of [141, where it is shown that 



the ith coordinate of the fix points of the operator T is equal 
to the expected mean number of visits before teleportation 
when starting a walk from page i. □ 



5. SPAM DETECTION AND PAGERANK 
DEMOTION 

We performed numerical experiments on the WEBSPAM- 
UK2007 dataset [l] . This is a crawl of the .uk websites with 
n = 105,896,555 pages performed in 2007, associated with 
lists of hosts classified as spam, nonspam and borderline. 
There is a training dataset for the setting of the algorithm 
and a test dataset to test the performance of the algorithm. 

We took 7 = 4, a — 0.85, N = 0.89n, c' t — 1 if i is a spam 
page of the training dataset, c' t = —0.2 if i is a nonspam 
page of the training dataset and c' t = otherwise. Then we 
obtained a MaxRank score and the associated bias vector. 
We also computed TrustRank and AntiTrustRank with the 
training dataset as the seed sets. We used the Webgraph 
framework [9] , so that we could manage the computation on 
a personal computer with four Intel Xeon CPUs at 2.98 Ghz 
and 8 GB RAM. We coded the algorithm in a parallel fashion 
thanks to the OpenMP library. Computation took around 
6 minutes for each evaluation of the dynamic programming 
operator of Proposition H] and 6 hours for 60 such iterations 
(precision on the objective a 60 < 6.10 -5 ). By comparison, 
PageRank computation with the same precision required 1.3 
hour on the same computer, which is of the same order of 
magnitude. 

Figure [l] gives the values taken by the bias vector. Figure [2] 
compares the precision and recall of PageRank, TrustRank, 
AntiTrustRank and MaxRank bias for spam or non spam 
detection. Precision and recall are the usual measures of 
the quality of an information retrieval algorithm 03. Pre- 
cision is the probability that a randomly selected retrieved 
document is relevant. Recall is the probability that a ran- 
domly selected relevant document is retrieved in a search. 
These values were obtained using the training and the test 
sets. Figure [3] compares TrustRank and MaxRank scores for 
PageRank demotion. 
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